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Abstract 

We investigate the posterior rate of convergence for wavelet shrinkage using a 
Bayesian approach in general Besov spaces. Instead of studying the Bayesian es- 
timator related to a particular loss function, we focus on the posterior distribution 
itself from a nonparametric Bayesian asymptotics point of view and study its rate 
of convergence. We obtain the same rate as in Abramovich et al. (2004) where the 
authors studied the convergence of several Bayesian estimators. 
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1 Introduction 



Infinite-dimensional Bayesian methods have become quite popular recently, 
due to both the computational and theoretical advances in this field. There 
are many results concerning posterior convergence using appropriate priors. 
These developments originate from the consideration of density estimation 
problems. In these problems, given the prior n„ on the set V of probability 
distributions, the posterior is a random measure: 



n„(fl|A,....,xj-^-n?..P(A-,Mn,.(P) 



!m.iP(Xi)<mjP) 

We say that the posterior is consistent if 
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n„(P G V : rf(P, Po) > e|^) in Pg" probability. 



where Pq is the true distribution and d is some suitable distance function 
between probability measures. 

To study rates of convergence, let e„ be a sequence decreasing to zero, we say 
the rate is at least e„ if for sufficiently large constant M 

n„(P : (i(P, Po) > Me„|X) ^ in P^ probability. 



It turns out the convergence rates are closely related to the existence of tests 
that separate the hypotheses in convex sets. 



The most general result appeared recently in lGhosal and Van Per VaartI (120071 ). 
where the formulation includes both density estimation and regression prob- 
lems, and the results also extend to non-iid cases such as stationary and non- 
stationary sequence of observations. In this general context, the definition of 
the posterior convergence rate is similar except the measure P and Pq repre- 
sent the distribution on the data and thus depends on sample size n, and the 
observations are no longer i.i.d. so the likelihood used in ([1]) must be changed 
to a more general form accordingly. 

Another relatively recent development in statistics is the investigation of wavelet 
method which has found numerous applications in engineering as well. There 
are many theoretical results explaining why wavelet transformation is effective, 
from both the frequentist and the Bayesian point of view. These well-known 
results incl ude the now widely celebrated works of David Dono ho and his col- 
laborators (jPonoho and Johnstond . 1 1994 iDonoho et al.l . Il996( ) . The property 
that distinguishes these works from previous results is that a single estima- 
tor can achieve the minimax rate over a range of function spaces including 
functions with inhomogeneous smoothness, whose minimax rate cannot be 
achieved by the simpler linear estimator. The sparsity of the coefficients for 
the function in an appropriate basis is the key to the success of the wavelet 
thresholding approach. 

Bayesian approach to func t ion estimat ion in Besov spaces has been investi- 
gated in lAbramovich et al.l (Il998l . |2004| ). In these approaches, after specifying 
an appropriate prior, the Bayesian estimator is obtained from the posterior 
and investigated from the frequentist point of view. In particular, they study 
the rate of convergence of different point estimators including the posterior 
mean and posterior median as well as other e stimators derived f r om th e pos- 
terior distribution. The theoretical results in lAbramovich et al.l (120041 ) show 
that some Bayesian estimators can achieve the better-than-linear rates if an 
appropriate prior is chosen that implicitly implements shrinkage or threshold- 
ing rule similar to the frequentist approach. 
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In a Bayesian framework, most researchers are more interested in the posterior 
as a distribution, instead of the point estimates derived from specific loss 
function. The convergence of the posterior distribution in this context has not 
been studied. This paper inten ds to fill this gap. Using the same prior as in 
Abramovich et al.l (jl998l . 120041 ). we show that the posterior distribution has 
the same convergence rate as the point estimators proposed in those papers. 



We describe the model and present the main theorem in Section [21 Some 
possible extensions for our result are discussed in the final section. 



2 Main result 



Consider the white noise model 



dX{t) = f{t)dt + andW{t) (2) 

where = 1/n, / G Bp g[0, 1] and W is the standard Brownian motion. Using 
wavelet basis on [0,1] with sufficient regularity, the function / can be expanded 

as 



2^0-1 2^-1 

f=J2 «iofc0iofc i^jki'jk 

k=0 j>jo k=0 

where (pj^k are the scaling functions and 'ipjk are the mother wavelets at reso- 
lution j, and jo is the lowest resolution in the expansion. We assume jo = 
for simplicity of notation below. 

The Besov spaces include the well-known Sobolev and Holder classes of func- 
tion and also nearly contains the space of functions of bounded variation. The 
norm for the Besov space with parameter s > max(0, 1/p— l/2),l<p< oo, 
and 1 < g < cxo is defined as 



ll/IU|„ = l|Po(/)IU. + (E(2^1l^?.(/)IU^r)'/'^ 

i>o 

where Po{f) = aoo0oo is the projection of / on the "approximation space", 
and Qjif) = Y.'k=o' f^jki'jk is the projection of / onto the "detail space". 

In terms of the coefficients in the wavelet expansion, the Besov norm can be 
equivalently defined by 
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1/-? 



i=o 



Note that for cases where q = oo the usual change to the sup norm is required. 
By abuse of notation, we also define PjP to be the sequence /?' such that 
l^jk = Pjk when j < J and Pjj^ = when j > J. 

The w hite noise model ( El) is closely related to the nonparametric regression 
model (IBrown and Lowl . Il996l : iDonoho et al.l . Il995l ): 



n 



with standard normal noise. We choose to work with ([2]) for its simplicity of 
formulation. 

After wavelet transformation for ([2]), we get the Gaussian sequence model: 



^00 = "So + ^oo/v^ 

Xjk=l3% + Zjk/V^,3 > 0, A; = 0, 1, . . . , 2-'' - 1 



where the superscript indicates the true parameter. 



Using Bayesian approach for Gaussian sequence estimation, we put a prior on 



P%r^7rjN{0,a') + il-n,)5o 



(3) 



with hyperparamete rs x 2 , 7Tj x 2 '^^ , f or son ae a > 1, and 7 > . This 



prior is proposed in lAbramovich et al.l (119981 ). and lAbramovich et al.l (120041 ) 
investigated the optimality of some Bayesian estimators with this prior. The 
choice of the hyperparamet ers must satisfy so r ne con ditions for the prior to 
put positive mass on (lAbramovich et al.l . 1 19981 . Theorem 1), although 
this is not our focus here. In the following we assume a and 7 satisfy these 
conditions. We also assume the value Oqq is known for simplicity, which does 
not affect our asymptotic result. 



We also consider the sieve prior as in IShen and WassermanI ( I2OOII ) , and define 
the prior n„ by n„(y4) = J2m ^mX^'iA) where oc 2"'^™ for some /i > 0, 
and n™ is a prior on Pjk such that Pjk ~ N{0, 2""-') when j < m and Pjk = 
when j > m. 



The main result we obtain in this paper is the following: 



4 



Theorem 1 Consider a bounded subset of the Besov space: Bp^^^B) = {P E 
S^_J0,1], pllfi. ^ < B} and E B^^iB). Let a = 2s + 1 for p > 2, and 
a = (2s + 2 — 2/p) for 1 < p < 2. Then for sufficiently large constant M , 
under the prior we have 



n„ \ {(3jk] ■■ YXPjk - f3%y > MellXjkj m probabilUy, 

where el = (logn)2n-2^/(2s+i) ^h^np > 2, and el = (logn)2n-(2^+i-2/p)/(2.+2-2/p) 
when 1 < p < 2. 



Rema rk: The above rate of convergence is the same as in lAbramovich et al. 



(120041 ) for posterior mean and posterior median, except an extra log factor in 



our case, which we think might be an artifact of our proofs. 

Proof of Theorem 1. In the proof, we use C to denote generic constant whose 
value can change in difference locations. We make u se of the general result for 



Bayes ian posterior rate of convergence (Theorem 6 in lGhosal and Van Per Vaart 



(120071 ) ), although we only u se a simpler version which corresponds to Theorem 



2.1 in iGhosal et al.l (120001 ) in the iid case. Two conditions for the theorems 



must be verified: 

(I) log D {en, Bp^g{B) , II.II2) < nel, where D{e,F, ||.||) is the e-covering number 
of the space F with norm ||.||. 

(II) U^{f3 G B-^giB) : 1 1/5 - < el) > exp{-Cnel}, where denote the 
prior distribution as in ([3D constrained on Bp^{B) by renormalization. 



Corollary 2 in lNickl and Potschei] (l2007h gives the bracketing entropy number 



for Besov spaces as ifB(e, 5^ ,^(5), ||.|| 2) e^^^*. Since bracketing entropy 
number is an upper bound for usual entropy, e„ defined in the statement of 
the theorem obviously satisfies condition (I). 

Condition (II) is verified as follows: 

Since G 5^,^(5), there exists 5 such that < B - 5. Let J = 

(log2 n)/a. We have 



>iinm - m < e^,m\ss^^ < B} 

>n4$:5:(/3,, - /3°,)^ < ej2,\\PjP\\ss^^ <B- 5/2) ■ 

j=0 k 
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nn( E E(/5.^ - < 4/2, 1 1/3 - < 6/2) 

j=J+l k 

The above two terms are dealt with in the following two lemmas, which provide 
a lower bound of e"*""*^" and the theorem is proved. □ 

Lemma 1 n„(E°^j+i Efc(/?,fc-/?°fc)' < el/2,\\l3-Pjl3\\Bs^^ < 5/2) is hounded 
away from . 

Proof. Since Ej>j Efc(/9°fc)^ < Y.j>j 02-"^^'' < e^/S, where s' = s for p > 2 
and s' = s + l/2 — 1/p otherwise, we have 

oo 

nn( E E(/?.^-/55^)'<4/2) 

> n.( E /^Ifc < 4/8) 

j>J,k 

j>J,k 

> 1 - C • 2-("-i)Ve2 
^1 



On the other hand, n„(p - Pj/3||b|„ < 5/2) > n„(P||B|,^ < 5/2) =: t > 
wh en a and /? are cho s e app ropriately such that n„(i?p g) > (this is possible 
by lAbramovich et al.l (119981 ) ) . 

Thus n„(Er=j+iEfc(/9,fc - f3%)' < el/2,\\(3 - PjP\\bs^^ < 5/2) ^ t > as 
n —>■ oo. □ 



Lemma 2 n„(EE(/^.^ - f^'k? < e^/2, | |Pj/3| U^.^ <B- 5/2) > e 

j=0 k 



■Cnel 



Proof. Thi s probability can be bounded from below using the techniques in 
Section 5 of IShen and WassermanI (120011 ) . 



First we show 



nn(EE(/5.^ - P%f < 4/2, <B- 5/2) 

j=0 k 

>^n(tY.i(^,k- (3%)' <^rl/\ogn) (4) 

3=0 k 

for a small enough constant c, where r„ = n^^^+^l'^^^lpy'^'^^+'^) when p > 2 and 
Tn = n"'^/^^'^^^^^/^^ when 1 < p < 2. Notice we obviously have r„ = o(e„). 
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Case 1: p >2. 

Note ||/3j.||p < ||/5j.||2 when p > 2. Conditioned on the event Y.j=oY.kiPjk — 
Pjky < c^Tn/ logn, the B^^^ norm for PjP — Pj13^ can be bounded as follows: 



i=o 

3=0 

<2^(^+l/2-lM(^||^. _^0||g^l/g 

^2-^('*+V2-l/p)jmax(l/g-l/2,0)||p^^ _ p_^^0||^ 
< ^{s+l/2-l/p)/(2s+l) . 



Case 2: 1 < p < 2. 

Since ||/3j.||p < 2^(i/f-i/2)| |^^. j ^Yien 1 < p < 2, the norm for Pj(3-Pjf3^ 
can be bounded as follows: 



|Pj/3-Pj/?°||a 



<(f]2^^(^+^/2-i/p)gp^. _^0||,)i/g 
i=o 

<(x:2^-^'ii/3,.-/3°,iii)^/'' 

i=o 

<2'^^(X:i|/3,,-/3°||^)V'? 

<^./(2.+2-2/p). 
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Summarizing the above two cases, \\Pjl3 — P.jI3^\\b^^ will be less than 5/2 
when c is sufficiently small, and (jl]) is proved by noticing | |Pj/?| |bs < \\Pj[3 — 

What is left is to lower bound n„(X;/=o - < c^r^/ log n)) 

Obviously the above prior probability is smallest when ttj = 1 and thus the 
prior is a normal distribution. Let 8"^ = c^r^/ log n for simplicity of notation. 
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If we denote hj K = Ej=oj'^^ ^ (loga 

n)n^i^ the total number of variables 
with < J < J, and let A = exp{- E/=o Efc 2°J(/50fc)2}, A = {w^u : < 
j < J, < A; < 2^' - l,\\w\\l < 51} , we have 



j=0 k 

J 



>A(^)^/^n(2-^/^)^^ / exp{-5:2-^K,)n 



> 



^^l_^K/2 n(2-/^)- M^^ / «(^/-^) exp{-2°^5^ . rf« 

1 /'A K/2 



r(i^/2) 



In the above we used Lemma 3 in IShen and Wassermaru (120011 ) as well as 
the inequality F{b;a) := p^/Q^a;"~^e~^(ix ^ e°'e~^b°'a~°'a~^^^ which also 
appeared in that paper. □ 

If we use the sieve prior presented right before Theorem [H the same conclusion 
still holds. 

Theorem 2 The result of Theorem U\ is still true with the sieve prior. 

Proof. The entropy bound for condition (I) is unchanged. With the same 
J = log2n/Q;, we have 

^'nm-P'\\l<el) 
>Ur,{\\P-P%<el\mBs^,<B} 

>AX(EE(/5.^ < 4/2, < B) ■ 

j=0 k 

oo 

n;:( E El/?.'^ 4/2) 
j=j+i k 



In the second probability above the event is actually deterministic since 
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when j > J under the prior U:^, and J2j>j,kiPjky < ^n/^- So the probabihty 
of this term is 1 and Lemma [1] is not needed. 

For the first probabihty, the lower bound is exactly the same as above. So the 
lower bound for the prior probability n^(||/5 — /3°||2 < e^) bounded below 
by Aje"'""'^", and Xj = n~^^" is obviously ignorable (can be incorporated into 
the constant C in e"'""''^") in this case. □ 

If we focus on bounded functions only, then we can get the rate of convergence 
for the posterior mean and posterior median. 

Corollary 1 Consider the case where ||/?°||2 < 1 (I'nd the prior on (3 is also 
renormalized to put mass 1 on the set {(3 : ||/9||2 < !}■ Denote by [3 and jd the 
posterior mean and posterior median respectively. We have \ \(3 — (3^\\2 = 0(e„) 
and II/? — /3°||2 = 0(e„) in probability. 

Proof. First note that with slight modifications. Theorem [1] and Theorem [2] 
are still true when the prior is constrained to unit I2 balls. 



The result for posterior mean is well-known (iBarron et al.l . Il999l : iGhosal et al. 



2OOOI ) since the I2 loss is bounded under the current assumptions. 



For posterior median, since the I2 loss is now bounded and the posterior 
probability n^(||/5 — /3°||2 > Me'i\X) conve r ges to zero at least at the or- 



der (implicit in the proof of iGhosal et al.l (I2OOOI ). Theorem 2.1), we have 
E\\[3 — /3°||2 = O(e^) in probability, where the expectation is over the pos- 
terior distribution of 13. Then we use the simple fact that for any random 
variable X, £'[X^] < implies \median{X)\ < 2a. This can be seen by 
P(|X| > 2a) < E{X^)/{Aa^) < 1/2. Now replacing X by /3jk - f3%, and 
summing over j and A;, we get the convergence rate for (3. □ 



3 Discussion 



Using the approach of lGhosal et al.l (I2OOOI ) ; IGhosal and Van Per VaartI (120071 ) , 
we have investigated the convergence rate of the posterior distribution for 
Gaussian white noise model in Besov spaces. Investigation of posterior dis- 
tribution rather than the Bayes estimators seems to be more desirable from 
a philosophical and practical point of view, since the posterior distribution 
can be di rectly utilized to as s ess th e uncertainty of the Bayesian inference. As 
shown in Abramovich et aD (2004), their Bayes factor estimator can achieve 
a better rate of convergence (although it is still not optimal within the whole 
range 1 < p < 2). U sing the prior (|3D we can not hope to achieve this rate 
since it was shown in lAbramovich et al.l (12004 ) that the posterior mean can- 
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not achieve this faster rate and the rate for the posterior distribution is no 
faster than that of the posterior mean. 



The loss function used in this investigation is the simplest I2 loss. The exten- 
sion to more g eneral Ip norm is left for further research. The derived rate is 
the same as in lAbramovich et al.l (120041 ) up to an extra log term and is sub- 
optimal in the inhomogeneous cases 1 < p < 2. Heavy-tailed distributions like 
double exponential are successfully used in I Johnstone and Silverman! (120051 ) to 
achieve better rates and it was argued that the implicit thresholding in normal 
mixture are too heavy on high-resolution levels. We believe optimal rates for 
posterior distribution are achievable with similar heavy-tailed distributions. 
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